42 research outputs found

    Load Balancing Analysis of a Parallel Hierarchical Algorithm on the Origin2000

    Get PDF
    Colloque avec actes sans comité de lecture.The ccNUMA architecture of the SGI Origin2000 has been shown to perform and scale for a wide range of scientific and engineering applications. This paper focuses on a well known computer graphics hierarchical algorithm - wavelet radiosity - whose parallelization is made challenging by its irregular, dynamic and unpredictable characteristics. Our previous experimentations, based on a naive parallelization, showed that the Origin2000 hierarchical memory structure was well suited to handle the natural data locality exhibited by this hierarchical algorithm. However, our crude load balancing strategy was clearly insufficient to benefit from the whole Origin2000 power. We present here a fine load balancing analysis and then propose several enhancements, namely "lazy copy" and "lure", that greatly reduce locks and synchronization barriers idle time. The new parallel algorithm is experimented on a 64 processors Origin2000. Even if in theory, a communication over-cost has been introduced, we show that data locality is still preserved. The final performance evaluation shows a quasi optimal behavior, at least until the 32-processor scale. Hereafter, a problematic trouble spot has to be identified to explain the performance degradation observed at the 64-processor scale

    Shift-Based Parallel Image Compositing on InfiniBand Fat-Trees

    Get PDF
    International audienceParallel image compositing has been widely studied over the past 20 years, as this is one, if not the most, crucial element in the implementation of a scalable parallel rendering system. Many algorithms have been proposed and implemented on a large variety of supercomputers. Among the existing supercomputers, InfiniBandTM (IB) PC clusters, and their associated fat-tree topology, are clearly becoming the dominant architecture, as they provide the scalability, high bandwidth and low latency required by the most demanding parallel applications. Surprisingly, very few efforts have been devoted to the implementation and performance evaluation of parallel image compositing algorithms on this kind of architecture. We propose in this paper a new parallel image compositing algorithm, called Shift-Based, relying on a well-known communication pattern called shift permutation. Indeed, shift permutation is one of the possible ways to get the maximum cross bisectional bandwidth provided by an IB fat-tree cluster. We show that our Shift-Based algorithm scales on any number of processing nodes (with peak performance on specific counts), allows overlapping communications with computations and exhibits contention-free network communications. This is demonstrated with the image compositing of very high resolution images at interactive frame rates

    Pipelined Sort-last Rendering: Scalability, Performance and Beyond

    Get PDF
    We present in this paper a theoretical and practical performance analysis of pipelined sort-last rendering for both polygonal and volume rendering. Theoretical peak performance and scalability are studied, exhibiting maximum attainable framerates of 19 fps (volume rendering with back-to-front alpha blending) and 11 fps (polygonal rendering with Z-buffer compositing) for a 1280×1024 display on a Gigabit Ethernet cluster. We show that our implementation of pipelined sort-last rendering on a 17-node PC cluster can nearly sustain these theoretical figures. We finally propose possible enhancements that would allow to go beyond the maximum theoretical limits. This paper clearly shows the potential of pipelined sort-last rendering for real-time visualization of very large models on standard PC clusters

    Interactive Poster: Visualizing the Interaction Between Two Proteins

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceProtein docking is a fundamental biological process that links two proteins in order to change their properties. The link is defined by a set of forces between two large areas of the protein boundaries. Two docked proteins are very close to each other due to the VdW forces. This makes the understanding of the phenomenon difficult using classical molecular visualization. We present a way to focus on the most interesting area: the interface between the proteins. Visualizing the interface is useful both to understand the process thanks to co-crystallized proteins and to estimate the quality of docking simulation result. The interface may be defined by a surface that separates the two proteins. The geometry of the surface is induced by the VdW forces, while other forces can be represented by attributes mapped onto the surface. We present a very fast algorithm that extracts the interface surface. Moreover, the result of a rigid docking simulation can be improved using the flexibility of the residues. We show how the interface surface geometry and attributes can be updated in real-time when the user interactively moves the residues. This way, we allow expert knowledge to be intuitively introduced in the process to enhance the quality of the docking

    COTS Cluster-based Sort-last Rendering: Performance Evaluation and Pipelined Implementation

    Get PDF
    Sort-last parallel rendering is an efficient technique to visualize huge datasets on COTS clusters. The dataset is subdivided and distributed across the cluster nodes. For every frame, each node renders a full resolution image of its data using its local GPU, and the images are composited together using a parallel image compositing algorithm. In this paper, we present a performance evaluation of standard sort-last parallel rendering methods and of the different improvements proposed in the literature. This evaluation is based on a detailed analysis of the different hardware and software components. We present a new implementation of sort-last rendering that fully overlaps CPU(s), GPU and network usage all along the algorithm. We present experiments on a 3 years old 32-node PC cluster and on a 1.5 years old 5-node PC cluster, both with Gigabit interconnect, showing volume rendering at respectively 13 and 31 frames per second and polygon rendering at respectively 8 and 17 frames per second on a 1024×768 render area, and we show that our implementation outperforms or equals many other implementations and specialized visualization clusters

    High Performance Computing and Visualization for Forestry Applications - Project SILVES

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceModelling, simulation and visualisation of forest cover is a very important aim for the sylviculture. Now the development of different techniques in tree and forest modelling, in GIS is sufficient to develop specific software used in forest simulation. Now, having a better way to represent and to interact with large models like forests are topics of high interest. After a short presentation of the project SILVES, we resent in this paper the design and implementation of a visualisation program that involves two main features. It is based on the central data model, in order to display in real time the modifications caused by the modeler. Furthermore, it benefits from the different immersive environments which give the user a much more accurate insight of the model than a regular computer screen. Then, we focus on the difficulties that come in the way of performance

    Experimentation of Data Locality Performance for a Parallel Hierarchical Algorithm on the Origin2000

    Get PDF
    Colloque avec actes et comité de lecture.Hierarchical algorithms form a class of applications widely being used in high-performance scientific computing, due to their capability to solve very large physical problems. They are based on the physical property that the further two points are, the less they influence each other. However, their irregular and dynamic characteristics make parallelizing them efficiently a challenge. Indeed, two conflicting objectives have to be taken into account: load balancing and data locality. It has been shown that the message passing paradigm was not well suited for this kind of applications, because of the intensive communication they introduce. Implicit communication through a shared address space appears to be better adapted. Particularly, the ccNUMA architecture of the Origin2000 can help us getting the desired data locality through its memory hierarchy. We have experimented a parallel implementation of a well known computer graphics hierarchical algorithm: the wavelet radiosity. This algorithm is a very efficient approach to compute global illumination in diffuse environments but still remains too much time and memory consuming when dealing with extremely complex models. Our parallel algorithm focuses on load balancing optimization and heavily relies on the ccNUMA architecture efficiency for data locality. Load balancing is handled with a general dynamic tasking mechanism with specific improvements. Minimal efforts are made towards memory management (like the writing of thread-safe non-blocking malloc/free C functionalities) and the Origin2000 proves all its capabilities to efficiently handle the natural data locality of our application. Our best results yield a speed-up of 24 with 36 processors. Moreover, we were able to compute the illumination of a complex scene (a cloister in Quito, composed of 54789 initial surfaces and leading to 600000 final meshes) in 2 hours 41 minutes with 24 processors. To the knowledge of the authors, this is the most complex "real world" scene ever computed

    Parallel Wavelet Radiosity

    Get PDF
    Colloque avec actes et comité de lecture.This paper presents parallel versions of a wavelet radiosity algorithm. Wavelet radiosity is based on a general framework of projection methods and wavelet theory. The resulting algorithm has a cost proportional to O(n) versus the O(n^2) complexity of the classical radiosity algorithms. However, designing a parallel wavelet radiosity is challenging because of its irregular and dynamic nature. Since explicit message passing approaches fail to deal with such applications, we have experimented various parallel implementations on a hardware ccNUMA architecture, the SGI Origin2000. Our experiments show that load balancing is a crucial performance issue to handle the dynamic distribution of work and communication, while we do make all reasonable efforts to exploit data locality efficiently. Our best results yield a speed-up of 24 with 36 processors, even when dealing with extremely complex models

    Overlapping Multi-Processing and Graphics Hardware Acceleration: Performance Evaluation

    Get PDF
    Colloque avec actes et comité de lecture.Recently, multi-processing has been shown to deliver good performance in rendering. However, in some applications, processors spend too much time executing tasks that could be more efficiently done through intensive use of new graphics hardware. We present in this paper a novel solution combining multi-processing and advanced graphics hardware, where graphics pipelines are used both for classical visualization tasks and to advantageously perform geometric calculations while remaining computations are handled by multi-processors. The experiment is based on an implementation of a new parallel wavelet radiosity algorithm. The application is executed on the SGI Origin2000 connected to the SGI InfiniteReality2 rendering pipeline. A performance evaluation is presented. Keeping in mind that the approach can benefit all available workstations and super-computers, from small scale (2 processors and 1 graphics pipeline) to large scale (pp processors and nn graphics pipelines), we highlight some important bottlenecks that impede performance. However, our results show that this approach could be a promising avenue for scientific and engineering simulation and visualization applications that need intensive geometric calculations

    Distributed Shared Memory for Roaming Large Volumes

    Get PDF
    We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While the size of the probe is limited by the total amount of texture memory on the cluster, the size of the total data set has no theoretical limit. The cluster is used as a distributed graphics processing unit that both aggregates graphics power and graphics memory. A hardware-accelerated volume renderer runs in parallel on the cluster nodes and the final image compositing is implemented using a pipelined sort-last rendering algorithm. Meanwhile, volume bricking and volume paging allow efficient data caching. On each rendering node, a distributed hierarchical cache system implements a global software-based distributed shared memory on the cluster. In case of a cache miss, this system first checks page residency on the other cluster nodes instead of directly accessing local disks. Using two Gigabit Ethernet network interfaces per node, we accelerate data fetching by a factor of 4 compared to directly accessing local disks. The system also implements asynchronous disk access and texture loading, which makes it possible to overlap data loading, volume slicing and rendering for optimal volume roaming
    corecore